Optimized network clustering by jumping sub-optimal dendrograms 
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We propose a method to improve community division techniques in networks that are based on 
agglomeration by introducing dendrogram jumping. The method is based on iterations of sub- 
optimal dendrograms instead of optimization of each agglomeration step. We find the algorithm to 
exhibit excellent scaling behavior of its computational complexity. In its present form the algorithm 
scales as 0{N^), but by using more efflcient data structures it is possible to achieve a scaling of 
0{N\og^N). We compare our results with other methods such as the greedy algorithm and the 
extremal optimization method. We find modularity values larger than the greedy algorithm and 
values comparable to the extremal optimization method. 



I. INTRODUCTION 

The study of communities in networks has received 
considerable attention recently. Generally, a community 
can be thought of as a subset of nodes of the network in 
which the nodes within a community are more connected 
among each other than they are connected to the other 
nodes in the network. By analyzing a network in terms 
of its communities, it is possible to gain understanding 
of the structure of a network on a larger scale and to 
uncover previously unnoticed connections between nodes 
or groups of nodes. Examples of successful community 
division studies include a study on the relationship be- 
tween diseases and genes the identification of tran- 
sition states in potential energy landscapes 0, and the 
identification of recording locations and racial commu- 
nity structures in a jazz musicians network in the USA 
around 1920 3. We have recently used our modularity 
optimization algorithm to optimize the performance of 
a recursive inverse factorization technique used in large 
scale electronic structure calculations [J] . 

The analysis of a network in terms of communities 
poses a difficult challenge. The clustering algorithm has 
to be accurate so that it identifies informative commu- 
nity structures. This implies that the algorithm has to 
consider many of the possible community divisions in the 
network before it can decide which one is the best. From 
a computational point of view we are faced with a rapidly 
growing problem as a function of network size. To con- 
sider all possible community divisions becomes compu- 
tationally unfeasible even for relatively small networks. 
In fact, it can be shown that the number of ways of di- 
viding a network into communities grows as the Sterling 
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number of the second kind [5|. 

In order to quantify the quality of a community di- 
vision, a quality function is introduced that assigns a 
"value-reflecting" number, or "quality-of-split" 0] to a 
community division. The modularity, as introduced by 
Girvan and Newman is a popular choice for such a 
quality function. Although there may be some draw- 
backs to this approach as pointed out by Fortunato and 
Barthclcmy [S], the community divisions that are ob- 
tained by optimizing the modularity typically give valu- 
able information P,!^. 

In the literature many different optimization strategies 
can be found that employ the modularity. They vary in 
quality, i.e. the value of the largest modularity they flnd, 
and in the computational effort. An efficient agglomera- 
tive method is the greedy algorithm of Newman [5| which 
Clauset et al. |9| showed to run in compu- 
tational effort. Other methods however, find modularity 
values larger than the greedy algorithm. These include 
extremal optimization [l0| . basin- hopp ing 0, simulated 
annealing [ll| . recursive filtration [12|, a heuristic algo- 
rithm |13| , and a spectral algorithm [l4| . 

In this article we introduce a new aggiomerative 
method which we demonstrate by using the modularity 
as the quality-of-split function. Our method is general 
however and can be used with any other quality-of-split 
function. Wc find modularity values comparable to the 
extremal optimization method for a set of well-studied 
networks. In section |ll] wc summarize the aggiomera- 
tive greedy algorithm and introduce our method. In sec- 
tion mil we compare with results for some popular net- 
works using the greedy method and the extremal opti- 
mization method. Finally, in section IIVI we present our 
conclusions. 
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II. THEORY 

A network of N nodes can be divided into any num- 
ber of communities, C, where 1 < C < iV. The ex- 
tremal cases C = 1 and C = N are the two trivial 
solutions in which all nodes either belong to only one 
common community or each belongs only to its own sep- 
arate community, respectively. Given that the number 
of possible community divisions is exceedingly large for 
any decently sized network, the problem of finding the 
optimal community split cannot be approached by cal- 
culating all possible splits. Instead one has to resort to 
approximate solutions of the problem. One possibility 
is to attempt to find the optimal community division by 
starting with one of the two trivial cases and proceeding 
by either stepwise merging two communities or by split- 
ting a community into two until the opposite extremal 
case is reached. These two approaches are commonly re- 
ferred to as agglomerative and divisive methods, respec- 
tively. The set of community divisions for every value 
of C between 1 < C < iV is called a dendrogram. Such 
dendrogram-based optimization methods aim to find the 
best community split by optimizing each step along the 
dendrogram based on the change in the quality-of-split 
measure [H, [l^ [l^. These methods are typically very 
efficient computationally, but the quality of their result 
depends critically on the heuristic chosen during the step- 
wise optimization process. In fact, in a recent comparison 
of dendrogram-based methods and a simulated annealing 
technique, Danon et al. [l7j found that the simulated an- 
nealing method, which is not bound to a dendrogram, is 
able to find better solutions to the community division 
problem than the dendrogram-based methods. 

The quality-of-split function we will use for this study 
is the modularity, Q, of Girvan and Newman 0. It is 
defined as 

Q = Tr(e)-^(e%-, (1) 

where e is the assortative mixing matrix (which is a C x C 
matrix). The elements are given by the number of 
links from community i to j for a particular community 
division as a fraction of the total number of links. 

The greedy algorithm was introduced by Newman Q 
and is a good example of an agglomerative method. A 
range of other agglomerative methods have been pro- 
posed which differ in how each step in the merge process 
is chosen 0, 0, [3 ■ The change in modularity due to 
merging two communities i and j can be written as Q 

/S.Q = eij + - a.ibj ~ ajhi, (2) 

where and bi are the column and row sums of e, re- 
spectively. Each merge process is chosen such as to max- 
imize the effected modularity change in the hope that 
this leads to the community division with the maximum 



modularity. What makes the greedy method particularly 
attractive is the fact that eq. ([2]) is inexpensive to eval- 
uate (it is of 0(1) computational effort). In addition, 
the number of community merges to evaluate is at most 
{N — \){N — 2) • • • which leads to an overall computa- 
tional complexity of 0{N\o^ N) 



A. Sub-optimal iterations 

Optimization techniques which are bound to a dendro- 
gram generally find smaller values of the modularity than 
methods which do not operate along a dendrogram p^ . 
This is due to the fact that the optimal solution may 
not be accessible by walking a stepwise optimized den- 
drogram. Optimal choices in the beginning of the den- 
drogram (large C for agglomerative methods) may lead 
to community divisions which can not be merged further 
to achieve the optimal division. This is of course also 
a problem for divisive methods. It may therefore not be 
possible to find a heuristic that finds the optimal commu- 
nity division along one single dendrogram. Motivated by 
our previous work on the modularity density of networks 
[20| we avoid the problem by performing incomplete op- 
timizations in each merge process. Wc do not consider all 
C(C — 1) possible merges, but only a smaller randomly 
chosen subset and pick the merge with the largest mod- 
ularity increase from this subset. This procedure has the 
advantage that it allows for randomness in the agglomer- 
ation process so that two different runs will not give the 
same result. The randomness is obviously also a draw- 
back since it is very unlikely that we find the optimal 
community split in one such random dendrogram. Even 
several sub-optimal dendrograms will be unlikely to have 
produced the optimal community division. We therefore 
iterate over several random dendrograms in an inner loop 
and optimize a list of modularity values for each value of 
C in an outer loop. Our algorithm is expressed in terms 
of a pseudocode in figure [TJ 

Inside the inner loop we analyze a "somewhat" ran- 
dom dendrogram. By this we mean that we pick the best 
out of n randomly chosen possible merge processes in 
each step of the dendrogram (fine 4 in figure [1]). Clearly, 
in the limit n — > 1, the dendrogram is random. The 
greedy method, on the other hand, corresponds to at 
most n = C(C — 1) different merge processes. For any 
value n < C{C — 1) we achieve sub-optimal agglomer- 
ation. As we decrement C and walk the dendrogram 
from larger to smaller C values, the merge process with 
the largest modularity gain becomes the proposed merge 
process. We store a list of the best modularities found so 
far for each value of C and merge along the proposed 
merge process only if the proposed merge produces a 
higher modularity value than the previous best value. If 
the proposed merge process produces a modularity value 
equal or less than the best modularity value thus far, we 
discard it and load the community division correspond- 
ing to the best modularity found so far to continue. It is 
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1: for iV°"t" times do {Outer loop} 

2: for iV'""°'^ times do {Inner loop} 

3: for C = N to C > 2 do {Suboptimal agglomeration} 

4: Find n random merge candidates. 

5: Calculate AQ„(C C - 1) from eq. ((2)). 

6: if maxiQ'S"''' + AQ„) > Q'S'l" then {Accept proposed 

merge process} 
7: e'S'"" ^ e 

8: Q'S-i <= Qc"" + AQ„ 
9: Calculate new e. 

10: else {Reject proposed merge process} 

n_ , dinner 
: e <;= ec-1 

12: end if 

13: end for{Suboptimal agglomeration} 
14: end for{Inner loop} 

15: for C = 1 to C < do {Update list of modularity val- 
ues} 

16: if Q'S"""' > Q°c"'" then 

1 T /pouter J /pinner 

18: end if 

19: end for{Update list of modularity values} 
20: end for{Outer loop} 

FIG. 1: Sub-Optimal Dendrogram Jumping Algorithm: (line 
3) This loop performs the suboptimal agglomeration. In this 
study Q means the modularity, but any quality-of-split func- 
tion can be used, (line 2) The inner loop restarts the agglom- 
eration process. We improve on the best modularities found 
by accepting proposed merge processes only if they lead to 
higher modularity values, (line 1) The best modularity val- 
ues found in the inner loop are stored and the inner loop is 
restarted. 



this step (lines 6 through 12 in figure [T|) which leads to 
a dendrogram jump. The hst of best modularity values 
of the inner loop usually converges rather quickly and 
the algorithm cannot improve on the modularities any 
longer. 

In the outer loop we store another list of best modular- 
ity values. Once the inner loop completes we update the 
list of the outer loop with the values found in the inner 
loop (lines 15 through 19 in figure [1]). We then reini- 
tialize the inner loop for another run. This step allows 
the algorithm to explore a different part of the commu- 
nity division space since it will randomly choose other 
dendrograms. 

The 3 tunable parameters in our algorithm, n, TV'nner^ 
and iV°"*'*'', are chosen according to the following heuris- 
tic: (1) The number of suboptimal trials, n, should be 
significantly smaller than the greedy limit, C(C — 1). Al- 
though our method will work with any n > \, we have 
found in practice that any value between 5 < n < 10 
works well. (2) The number of inner loops, N™^^^'^ , should 
be chosen as small as possible such that the Q™^^^ list 
is converged. This process is exemplified by the con- 
vergence of the blue dashed lines in the upper panel 
of figure [2l The value of this particular parameter de- 
pends strongly on the network size. We have found val- 
ues of iV'""°'' between 8 and 150 for the smaller networks 
(Zachary Karate Club) and the larger networks (Jazz mu- 



sician and e-mail network), respectively, to work well. 
(3) There is no predetermined limit on the number of 
outer loops, A^o^t^r is determined by the level of con- 
vergence desired for the maximum modularity value. The 
more outer loops are performed, the more likely it is to 
find the maximum modularity value. 

Iterating over the inner and outer loop converges 
rapidly and we find a list of the best modularity val- 
ues for all C . In the following section we will compare 
our results with previously studied networks. 



III. RESULTS 

In the upper panel of figure [2] we show the best modu- 
larity values found by our iterative dendrogram jumping 
method for the unweighted Zachary Karate Club network 
(black circles). This result was obtained with Af'""°'' = 8 
inner loop iterations and A^°"t°'' = 20 outer loop itera- 
tions. The number of sub-optimal trials in each agglom- 
eration was n = 10. We indicate the borders between 
sections of the curve that belong to the same dendrogram 
to illustrate the dendrogram jumping of our algorithm. 
In the case shown, we find 12 such borders. We find the 
maximum modularity at C = 4 at a value of Q = 0.4198. 
The blue dashed lines show the evolution of an optimiza- 
tion in the inner loop consisting of 8 sub-optimized Q{C) 
curves obtained by 10 merge trials per C . We find that 
this particular run over the inner loop of the dendrogram 
jumping method achieves high modularities for C = 3, 
7, and 8 but fails to find the largest modularity that we 
find after looping over the outer loop. 

In the lower panel of figure [2] we show the first few 
branches of the resulting disconnected dendrogram for 
the overall optimization of the Zachary Karate Club net- 
work. At C = 1 , all nodes are in the same community as 
indicated by the single black circle. As this community 
is split into two and subsequently three communities, we 
find that the figure looks like an ordinary dendrogram. 
However, splitting 3 into 4 communities, the community 
division corresponding to the highest modularity value 
results in the merging of two communities and the split- 
ting of two communities. We indicate this merging by the 
first red circle. This is a situation that cannot occur in 
a dendrogram and thus indicates a dendrogram jump in 
the maximum modularity curve. Two more such events 
are marked with red circles. 

We have performed modularity optimizations by 
means of our improved sub-optimal dendrogram jumping 
algorithm for a set of well-known networks. In table U we 
present our results for the maximum modularity and the 
number of communities that were obtained. The results 
are compared to the corresponding maximum modular- 
ity found by the greedy method [ij] and the extremal 
optimization method [10| . where applicable. We find 
that dendrogram jumping always finds modularity val- 
ues larger than the greedy algorithm and values compa- 
rable to the extremal optimization method. In most cases 
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FIG. 2: Upper panel: The dashed lines (blue online) indi- 
cate the maximum modularity values found in the inner loop. 
With each iteration of the inner loop the maximum modular- 
ity increases and eventually converges, shown with the bold 
dashed line (blue online). The circles show the maximum 
modularity values found after the outer loop is converged. 
The vertical lines indicate sections that belong to one dendro- 
gram. Dendrogram jumping occurs across the vertical lines. 
Lower panel: A generalized dendrogram that corresponds to 
the maximum modularity of the unweighted Zachary Karate 
Club network. Events that indicate dendrogram jumps are 
marked with red circles. 



the number of communities corresponding to the largest 
modularity value differs from what was found with the 
other methods. We never found a case in which the com- 
munities had the same members in the different methods 
even in the cases were the number of communities was 
the same. For the small networks the community mem- 
bers differed only in few nodes. For the larger networks 
however, we found more significant differences in the as- 
signment of nodes to communities. 



The evaluation of AQ of eq. ([2|) can be done in com- 
putational effort C(l) time. The subsequent update of 
e after a merge operation takes 0{N) worst time and 
there are A'^ — 1 such merge operations per dendrogram. 
We iterate through a fixed number of inner and outer 
loops, i.e. the total computational effort is 0{N'^). Not 
surprisingly, this is identical to the computational effort 
found for the greedy algorithm Q in the sparse graph 
limit. The inner loop of our algorithm is a generalization 
of the greedy method in the limit of n ^ C(C — 1). In 
our algorithm we only consider a small fixed number n of 
merge candidates, which implies that our method always 
scales 0{N'^) even in the dense graph case. However the 
data structures used by Clauset et al. to speed up 
the greedy algorithm clearly are readily applicable in our 
case as well, which makes it possible to reduce the com- 
putational effort to OfiVlog^iV). For comparison, the 
extremal optimization [101 rnethod runs in 0{N'^ log N) 
time. 



We have successfully used our algorithm in its current 
implementation for networks of up to ~ 1200 nodes but 
networks an order of magnitude larger should be ana- 
lyzable with a current desktop workstation and sufficient 
memory installed. 



IV. CONCLUSIONS 



We have presented a generic optimization technique 
that applies to community detection algorithms that are 
agglomerative. We demonstrated the efficiency of our 
method by calculating the maximum modularity for a 
set of networks and comparing our results with two other 
methods, the greedy method by Newman and coworkers 
0, Q and the extremal optimization method by Duch 
and Arenas p^ . In this comparison we found mod- 
ularity values for all examples studied that are larger 
than the results of the greedy algorithm and compara- 
ble to the results of the extremal optimization method. 
The computational complexity of our method ultimately 
is 0{Nlog'^N). Our method therefore has the same 
computational complexity scaling behavior as the greedy 
method. The extremal optimization method is computa- 
tionally more expensive and scales with a computational 
complexity of 0{N'^ log A^). In applications in which suf- 
ficient memory is available, our algorithm therefore is the 
method of choice. 

Our study showed that the community divisions of 
maximum modularity are not connected by one single 
dendrogram and thus any method which aims at opti- 
mizing the modularity by optimizing each step along a 
dendrogram will fail. This finding confirms a similar con- 
clusion drawn by Medus et al. [l9|. This has important 
implications for the further development of modularity 
optimization methods. 
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Network 


Ref. 


Nodes 


Edges 


gmax 


C 


Qgrcody 


^greedy 






Zachary 


21, 22 


34 


156 


0.4198 


4 


0.3807 


3 


0.4188 


4 


Zachary (W) 


21, 22 


34 


156 


0.4449 


4 


0.4345 


3 






Fraternities Subj. (W) _21, 23, 24 


58 


3306 


0.0486 


3 


0.0412 


3 






Fraternities Obj. (W) 


21. 23. 24 


58 


1934 


0.1460 


6 


0.1408 


6 






Dolphins 


_25, 26 


62 


318 


0.5285 


5 


0.4923 


4 






Prisoners 


21, 27 


67 


182 


0.6232 


9 


0.6217 


9 






Les Miserables (W) 


21, 28 


77 


508 


0.5667 


6 


0.5472 


5 






Les Miserables 


21, 28 


77 


508 


0.5600 


6 


0.5006 


5 






Grassland 


29 


88 


274 


0.6627 


9 


0.6609 


10 






Jazz bands 


I'M 


198 


5484 


0.4450 


4 


0.4389 


4 


0.4452 


5 


Littlerock 


29 


183 


4886 


0.3629 


4 


0.3395 


3 






Jazz musicians 


2 30 


1265 


76714 


0.5780 


18 


0.5235 


20 






e-mail 


31, 32 


1133 


10902 


0.5718 


11 


0.5093 


15 


0.5738 


15 



TABLE I: Our results for the optimized networks in this study compared to the greedy algorithm and the extremal optimization 
method were applicable. The entries labeled (W) are weighted networks. Shown are the number of nodes in the network (Nodes), 
the number of directed edges (Edges), the maximum modularity found by our method (Q™^"), the number of communities for 
this value of the modularity (C), the same quantities for the greedy algorithm (Qsrccdy ^^^^ ^grcody^^ ^ ^ ^j^g extremal 
optimization method (Q^'"' and C^'^). 
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